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Abstract. We consider the problem of jointly estimating multiple precision matrices from (aggregated) 
high-dimensional data consisting of distinct classes. An ^ 2 -penalized maximum-likelihood approach is em¬ 
ployed. The suggested approach is flexible and generic, incorporating several other ^ 2 -penalized estimators 
as special cases. In addition, the approach allows for the specification of target matrices through which 
prior knowledge may be incorporated and which can stabilize the estimation procedure in high-dimensional 
settings. The result is a targeted fused ridge estimator that is of use when the precision matrices of the 
constituent classes are believed to chiefly share the same structure while potentially differing in a number of 
locations of interest. It has many applications in (multi)factorial study designs. We focus on the graphical 
interpretation of precision matrices with the proposed estimator then serving as a basis for integrative or 
meta-analytic Gaussian graphical modeling. Situations are considered in which the classes are defined by 
data sets and/or (subtypes of) diseases. The performance of the proposed estimator in the graphical mod¬ 
eling setting is assessed through extensive simulation experiments. Its practical usability is illustrated by 
the differential network modeling of 11 large-scale diffuse large B-cell lymphoma gene expression data sets. 
The estimator and its related procedures are incorporated into the R-package rags2ridges. 

Keywords: Differential network estimation; Gaussian graphical modeling; Generalized fused ridge; High¬ 
dimensional data; ^ 2 -penalized maximum likelihood; Structural meta-analysis; Subclass data 


1. Introduction 


High-dimensional data are ubiquitous in modern statistics. Consequently, the fundamental problem of 
estimating the covariance matrix or its inverse (the precision matrix) has received renewed attention. Suppose 
we have n i.i.d. observations of a p-dimensional variate distributed as S). The Gaussian log-likelihood 

parameterized in terms of the precision matrix U = is then given by: 


( 1 ) 


£{n;S) oc ln|0| -tr(Sn), 


where S is the sample covariance matrix. When n > p the maximum of Q is attained at the maximum 
likelihood estimate (MLE) = S~^. However, in the high-dimensional case, i.e., when p > n, the 

sample covariance matrix S is singular and its inverse ceases to exist. Furthermore, when p ^ the sample 
covariance matrix may be ill-conditioned and the inversion becomes numerically unstable. Hence, these 
situations necessitate usage of regularization techniques. 

Here, we study the simultaneous estimation of numerous precision matrices when multiple classes of 
high-dimensional data are present. Suppose is a realization of a p-dimensional Gaussian random vector 
for i = l,...,ng independent observations nested within g = 1,...,G classes, each with class-dependent 
covariance Hg, i.e., y^g ~ Np{pig, Tig) for each designated class g. Hence, for each class a data set consisting 
of the Ug X p matrix Yg = [y^g,..., YnggV is observed. Without loss of generality fig = 0 can be assumed 
as each data set Yg can be centered around its column means. The class-specific sample covariance matrix 
given by 




i=l 


= _Y^Y 

I if. 


* Shared first authorship. 

^ Principal corresponding author. 
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then constitutes the well-known MLE of Sg as discussed above. The closely related pooled sample covariance 
matrix 


( 2 ) 


G rig G 

= — E E yrgyJg = — E 


3=1 i=l 


9=1 


where n, = X]g=i oft-used estimate of the common covariance matrix across classes. In the high¬ 

dimensional case p > n, (implying p > Ug) the Sg and S, are singular and their inverses do not exist. Our 
primary interest thus lies in estimating the precision matrices Hi = ,..., Hg = Sg , as well as their 

commonalities and differences, when p > n,. We will develop a general i 2 -peTiBlized ML framework to this 
end which we designate targeted fused ridge estimation. 

The estimation of multiple precision matrices from high-dimensional data classes is of interest in many 
applications. The field of oncogenomics, for example, often deals with high-dimensional data from high- 
throughput experiments. Class membership may have different connotations in such settings. It may refer 
to certain sub-classes within a single data set such as cancer subtypes (cancer is a very heterogeneous disease, 
even when present in a single organ). It may also designate different data sets or studies. Likewise, the class 
indicator may also refer to a conjunction of both subclass and study membership to form a two-way design 
of factors of interest (e.g., breast cancer subtypes present in a batch of study-specific data sets), as is often 
the case in oncogenomics. Our approach is thus motivated by the meta-analytic setting, where we aim for 
an integrative analysis in terms of simultaneously considering multiple data (sub-)classes, data sets, or both. 
Its desire is to borrow statistical power across classes by effectively increasing the sample size in order to 
improve sensitivity and specificity of discoveries. 


1.1. Relation to literature and overview. There have been many proposals for estimating a single 
precision matrix in high-dimensional data settings. A popular approach is to amend Q with an ^i-penalty 
[01211111149]. The solution to this penalized problem is generally referred to as the graphical lasso and it 
is popular as it performs automatic model selection, i.e., the resulting estimate is sparse. It is heavily used 
in Gaussian graphical modeling (GGM) as the support of a Gaussian precision matrix represents a Markov 
random field |H]. 

The £i-approach has been extended to deal with more than a single sample-group. Guo et al. |2l] have 
proposed a parametrization of class-specific precision matrices that expresses the individual elements as a 
product of shared and class-specific factors. They include ^i-penalties on both the shared and class-specific 
factors in order to jointly estimate the sparse precision matrices (representing graphical models). The penalty 
on the shared factors promotes a shared sparsity structure while the penalty on the class-specific factors 
promotes class-specific deviations from the shared sparsity structure. Danaher et al. m have generalized 
these efforts by proposing the joint graphical lasso which allows for various penalty structures. They study 
two particular choices: the group graphical lasso that encourages a shared sparsity structure across the class- 
specific precision matrices, and the fused graphical lasso that promotes a shared sparsity structure as well 
as shared precision element-values. A Bayesian approach to inferring multiple sparse precision matrices can 
be found in Peterson et al. |d2) . 

While simultaneous estimation and model selection can be deemed elegant, automatic sparsity is not 
always an asset. It may be that one is intrinsically interested in more accurate representations of class- 
specific precision matrices for usage in, say, covariance-regularized regression |46j or discriminant analysis 
[33| . In such a situation one is not after sparse representations and one may prefer usage of a regularization 
method that shrinks the estimated elements of the precision matrices proportionally. In addition—when 
indeed considering network representations of data—the true class-specific graphical models need not be 
(extremely) sparse in terms of containing many zero elements. The £i-penalty is unable to retrieve the 
sparsity pattern when the number of truly non-null elements exceeds the available sample size [42) . In such 
a situation one may wish to couple a non-sparsity-inducing penalty with a post-hoc selection step allowing 
for probabilistic control over element selection. We therefore consider ^2 or ridge-type penalization. 

In Section the targeted fused ridge estimation framework will be presented. The proposed fused ^ 2 - 
penalty allows for the simultaneous estimation of multiple precision matrices from high-dimensional data 
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classes that chiefly share the same structure but that may differentiate in locations of interest. The ap¬ 
proach is targeted in the sense that it allows for the specification of target matrices that may encode prior 
information. The framework is flexible and general, containing the recent work of Price et al. |33j and van 
Wieringen and Peeters [42] as special cases. It may be viewed as a £ 2 -alternative to the work of Danaher 
et al. [ 15 . The method is contingent upon the selection of penalty values and target matrices, topics that 
are treated in Section Section [^ then focuses on the graphical interpretation of precision matrices. It 
shows how the fused ridge precision estimates may be coupled with post-hoc support determination in order 
to arrive at multiple graphical models. We will refer to this coupling as the fused graphical ridge. This then 
serves as a basis for integrative or meta-analytic network modeling. Section [5 then assesses the performance 
of the proposed estimator through extensive simulation experiments. Section ^illustrates the techniques by 
applying it in a large scale integrative study of gene expression data of diffuse large B-cell lymphoma. The 
focus is then on finding common motifs and motif differences in network representations of (deregulated) 
molecular pathways. Section concludes with a discussion. 

1.2. Notation. Some additional notation must be introduced. Throughout the text and supplementary 
material, we use the following notation for certain matrix properties and sets: We use A 0 and B ^ 0 to 
denote symmetric positive definite (p.d.) and positive semi-definite (p.s.d.) matrices A and B, respectively. 
By M, IR_|_, and K++ we denote the real numbers, the non-negative real numbers, and the strictly positive 
real numbers, respectively. In notational analogue, S^, and are used to denote the space oi p x p 
real symmetric matrices, the real symmetric p.s.d. matrices, and real symmetric p.d. matrices, respectively. 
That is, e.g., 5^_|_ = {X e : X = X^ AX 0}. Negative subscripts similarly denote negative reals and 
negative definiteness. By A > B and similar we denote element-wise relations, i.e., (A)_,q > (B)jg for all 
{j,g). Matrix subscripts will usually denote class membership, e.g., Ag denotes (the realization of) matrix 
A in class g. For notational brevity we will often use the shorthand {Ag} to denote the set {Ag}^^]^. 

The following notation is used throughout for operations: We write diag(A) for the column vector com¬ 
posed of the diagonal of A and vec(A) for the vectorization operator which stacks the columns of A on top 
of each other. Moreover, o will denote the Hadamard product while 0 refers to the Kronecker product. 

We will also repeatedly make use of several special matrices and functions. We let Ip denote the {p x p)- 
dimensional identity matrix. Similarly, Jp will denote the (p x p)-dimensional all-ones matrix. In addition, 
0 will denote the null-matrix, the dimensions of which should be clear from the context. Lastly, || • |j|. and 
1[ • ] will stand for the squared Frobenius norm and the indicator function, respectively. 


2. Targeted fused ridge estimation 


2.1. A general penalized log-likelihood problem. Suppose G classes of (ug x p)-dimensional data exist 
and that the samples within each class are i.i.d. normally distributed. The log-likelihood for the data takes 
the following form under the additional assumption that all n, observations are independent: 

(3) /:({IIg};{Sg}) cx ^ng{ln|flg| -tr(Sgllg)}. 

g 


We desire to obtain estimates {Ilg} G of the precision matrices for each class. Though not a requirement, 
we primarily consider situations in which p > Ug for all g, necessitating the need for regularization. To this 
end, amend ([^ with the fused ridge penalty given by 

(4) /™ ({IIg};{Ag,gJ,{Tg}) = ^ ^ | | fig - Tg | | ^ + ^ ^ | | (fig, - Tg J - (fl g, - Tg J | | ^ , 

9 91,92 


where the Tg € 5^ indicate known class-specific target matrices (see also Section 3.31, the A 


99 


^++ 


denote class-specific ridge penalty parameters, and the Ag 


are pair-specific fusion penalty parameters 


subject to the requirement that Ag,g 2 = Agjg,. All penalties can then be conveniently summarized into a 
non-negative symmetric matrix A = [Ag,g 2 ] which we call the penalty matrix. The diagonal of A corresponds 
to the class-specific ridge penalties whereas off-diagonal entries are the pair-specific fusion penalties. The 
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rationale and use of the penalty matrix is motivated further in Section 3.1 Combining (§ and @> yields a 
general targeted fused ridge estimation problem: 


(5) 


arg max 




^99 I 

2 I 


O — T 

“ “5 5 11 if 


91,92 


'9192 

4 


(^Sl "^31) (^32 



The problem of © is strictly concave. Furthermore, it is worth noting that non-zero fusion penalties, 
^3132 > ffi 7^ 52 , alone will not guarantee uniqueness when p > n,: In high dimensions, all ridge 

penalties Xgg should be strictly positive to ensure identifiability. These and other properties of the estimation 
problem are reviewed in Section [2.2| 

The problem stated in ([^ is very general. We shall sometimes consider a single common ridge penalty 
Xgg = X for all g, as well as a common fusion penalty Xg-^g,^ = A/ for all class pairs gi 7 ^ g 2 (cf., however. 
Section 3.1 1 such that A = Alp + A/(Jp — Ip). This simplification leads to the first special case: 


arg max 




^ V 
2 ^ 

9 


1^3 ^3|lE 


-wElK^9.-T3J-(f^32-T 



Here and analogous to ©, A controls the rate of shrinkage of each precision Q,g towards the corresponding 
target Tg [35], while A/ determines the retainment of entry-wise similarities between ($7^^ — TgJ and 
(Hgj— Tg^) for all class pairs gi 7 ^ 52 - 

When Tg = T for all g, the problem further simplifies to 


( 6 ) 


arg max 


/:({I7g};{Sg})-^5]||l7g-T||^-^ 


51,52 


^31 ^32 I 


where the targets are seen to disappear from the fusion term. Lastly, when T = 0 the problem © reduces 
to its simplest form recently considered by Price et al. [33]. Appendix studies, in order to support an 
intuitive feel for the fused ridge estimation problem, its geometric interpretation in this latter context. 

2.2. Estimator and properties. There is no explicit solution to ([^ except for certain special cases and 
thus an iterative optimization procedure is needed for its general solution. As described in Section |2.3[ we 
employ a coordinate ascent procedure which relies on the concavity of the penalized likelihood (see Lemma 
in Appendix B.l) and repeated use of the following result, whose proof (as indeed all proofs) has been 
deferred to Appendix |B. 2 1 


Proposition 1. Let {Tg} S and let A G be a fixed penalty matrix such that A > 0 and diag(A) > 0. 
hat rig is 

optimization problem © is then given by 


Furthermore, assume that rig is p.d. and fixed for all g go- The maximizing argument for class go of the 


(7) 


^30 (-^i {^313/30) ~ 


A I 

^30 -*-3 


1 
4 


(®3o '^30 "^30) 


-I 1/2 


+ 


1 


(®3o -^30 "^30 


) 


where 

( 8 ) 


S30 = S,o - E 

37^30 


Ua 


-(f^3-Tg), 


T — T 

-*-50 50 5 


and 


\ _ ^90* 

'^On — 


with Ago, = ^990 denoting the sum of the goth column (or row) of A. 

Remark 1. Defining Tgg = Tgg in Proposition may be deemed redundant. However, it allows us to state 
equivalent alternatives to ([^ without confusing notation. See Section 2.3 as well as Appendix B.2 and 
Section 1 of the Supplementary Material. 

Remark 2. The target matrices from Proposition [^ may be chosen nonnegative definite. However, choosing 
n.d. targets may lead to ill-conditioned estimates in the limit. From a shrinkage perspective we thus prefer 


to choose {Tg} e See Section 3.3 
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Proposition [2 provides a function for updating the estimate of the goth class while fixing the remaining 
parameters. As a special case, consider the following. If all off-diagonal elements of A are zero no ‘class 
fusion’ of the estimates takes place and the maximization problem decouples into G individual, disjoint ridge 
estimations: See Corollary in Appendix B.2 The next result summarizes some properties of Q: 

Proposition 2. Consider the estimator of Proposition and its accompanying assumptions. Let Ctg = 
llg(A, {r2g/}g/^g) fec t/ic prectsion matrix estimate of the gth class. For this estimator, the following prop¬ 
erties hold: 


i. ^Ig >- 0 for all Xgg G K++; 
a. lim Clg = Sg^ ifY.g'^g \g' = 0 P < 

in. lim if \gg> < oo for all g' ^ g; 

Xgg - >- 00 ~ 

iv. lim _(^gj-TgJ= lim _ (fjg^ - TgJ i/ Agjg' < oo/or aH {gi,g^} 7 ^ {gi, 52 }- 


The first item of Proposition implies that strictly positive Xgg are sufficient to guarantee p.d. estimates 
from the ridge estimator. The second item then implies that if ‘class fusion’ is absent one obtains as the 
right-hand limit for group g the standard MLE S”whose existence is only guaranteed when p < rig. The 
third item shows that the fused ridge precision estimator for class g is shrunken exactly to its target matrix 
when the ridge penalty tends to infinity while the fusion penalties do not. The last item shows that the 
precision estimators of any two classes tend to a common estimate when the fusion penalty between them 
tends to infinity while all remaining penalty parameters remain finite. 

The attractiveness of the general estimator hinges upon the efficiency by which it can be obtained. We 
state a result useful in this respect before turning to our computational approach in Section [2.3| 

Proposition 3. Let fig = fig (A, {fig'jg'^^g) be the precision matrix estimate Q for the gth class and define 
[f2g]“^ = Sg. The estimate fig can then he obtained without inversion through: 


fl 


g 



(Sg - AgTg) 


1 


Afllp + ^ AgTgg) 


1/2 




2.3. Algorithm. Equation Q allows for updating the precision estimate fig of class g by plugging in the 
remaining fl'g, g' 7 ^ g, and assuming them fixed. Hence, from initial estimates, all precision estimates may 
be iteratively updated until some convergence criterion is reached. We propose a block coordinate ascent 
procedure to solve ([^ by repeated use of the results in Proposition]^ This procedure is outlined in Algorithm 
By the strict concavity of the problem in ([^, the procedure guarantees that, contingent upon convergence, 
the unique maximizer is attained when considering all fig jointly. Moreover, we can state the following result: 

Proposition 4. The gradient ascent procedure given in Algorithm will always stay within the realm of 
positive definite matrices 

The procedure is implemented in the rags2ridges package within the R statistical language |34j . This 
implementation focuses on stability and efficiency. With regard to the former: Equivalent (in terms of the 
obtained estimator) alternatives to (|^ can be derived that are numerically more stable for extreme values 
of A. The most apparent such alternative is: 

(9) Sg„=Sg„ Tgg=Tg,+ j2^iflg-Tg), and Ag, = ^ . 

g^go 

It ‘updates’ the target Tg instead of the covariance Sg and has the intuitive interpretation that the tar¬ 
get matrix for a given class in the fused case is a combination of the actual class target matrix and the 
‘target corrected’ estimates of remaining classes. The implementation makes use of this alternative where 
appropriate. See Section I of the Supplementary Material for details on alternative updating schemes. 
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Algorithm 1 Pseudocode for the fused ridge block coordinate ascent procedure. 


1 

Input: 


2 

Sufficient data: (Si, ni),..., (Sg, nc) 


3 

Penalty matrix: A 


4 

Convergence criterion: e > 0 


5 

Output: 


6 

Estimates: f2i,..., Hg 


7 

procedure ridgeP.flfsed(Si, ..., Sg, ni,..., no, A, e) 


8 

Initialize: Qg^^ for all 3 . 


9 

for c = 1, 2 ,3,... do 


10 

for 3 = 1, 2,..., G do 


11 

Update := fig (A, fli^\ fl^/l,, .. 

■ ,^G by 0 - 

12 

end for 


13 

ifmaxgj" "-}<ethen 


14 

return (fj..., I^g^) 


15 

end if 


16 

end for 


17 

end procedure 



Efficiency is secured through various roads. First, in certain special cases closed-form solutions to (§ 
exist. When appropriate, these explicit solutions are used. Moreover, these solutions may provide warm- 
starts for the general problem. See Section 2 of the Supplementary Material for details on estimation in these 
special cases. Second, the result from Proposition!^ is used, meaning that the relatively expensive operation 
of matrix inversion is avoided. Third, additional computational speed was achieved by implementing core 
operations in C-t--l- via the R-packages Repp and RcppArmadillo [HI [161 [HI [38] . These efforts make analyzes 
with large p feasible. Throughout, we will initialize the algorithm with = p/ tr(S,) • Ip for all g. 


3. Penalty and target selection 


3.1. The penalty graph and analysis of factorial designs. Equality of all class-specific ridge penalties 
Xgg is deemed restrictive, as is equality of all pair-specific fusion penalties Xg^g^ ■ In many settings, such as the 
analysis of factorial designs, finer control over the individual values of Xgg and befits the analysis. This 
will be motivated by several examples of increasing complexity. In order to do so, some additional notation 
is developed: The penalties of A can be summarized by a node- and edge-weighted graph V = (W, H) where 
the vertex set W corresponds to the possible classes and the edge set H corresponds to the similarities to 
be retained. The weight of node p € IT is given by Xgg and the weight of edge ( 31 , 32 ) & H is then given 
by Agjg^. We refer to V as the penalty graph associated with the penalty matrix A. The penalty graph V is 
simple and undirected as the penalty matrix is symmetric. 


Example 1. Consider G = 2 classes or subtypes (ST) of diffuse large B-cell lymphoma (DLBCL) patients 
with tumors resembling either so-called activated B-cells (ABC) or germinal centre B-cells (GCB). Patients 
with the latter subtype have superior overall survival [T]. As the GCB phenotype is more common than 
ABC, one might imagine a scenario where the two class sample sizes are sufficiently different such that 
i^GCB 3> UABC- Numeric procedures to obtain a common ridge penalty (see, e.g.. Section 3.21 would then be 
dominated by the smaller group. Hence, choosing non-equal class ridge penalties for each group will allow 
for a better analysis. In such a case, the following penalty graph and matrix would be suitable: 


ABC GCB 




A22 


( 10 ) 


V= All 
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Example 2. Consider data from a one-way factorial design where the factor is ordinal with classes A, B, 
and C. For simplicity, we choose the same ridge penalty A for each class. Say we have prior information 
that A is closer to B and B is closer to C than A is to C. The fusion penalty on the pairs containing the 
intermediate level B might then be allowed to be stronger. The following penalty graph and matrix are thus 
sensible: 


( 11 ) 


V = 



A = 


A Ab Aac 
Ab a Ab 
Aac Ab A 


Depending on the application, one might even omit the direct shrinkage between A and C by fixing Aac = 0. 
A similar penalty scheme might also be relevant if one class of the factor is an unknown mix of the remaining 
classes and one wishes to borrow statistical power from such a class. ■ 

Example 3. In two-way or n-way factorial designs one might wish to retain similarities in the ‘direction’ 
of each factor along with a factor-specific penalty. Consider, say, 3 oncogenic data sets (DSi, DS 2 , DS 3 ) 
regarding ABC and GCB DLBCL cancer patients. This yields a total of G = 6 classes of data. One choice 
of penalization of this 2 by 3 design is represented by the penalty graph and matrix below: 


( 12 ) 



A = 


Ads 


A 

Ads 

Ads 

Ast 

0 

0 

Ads 

A 

Ads 

0 

Ast 

0 

Ads 

Ads 

A 

0 

0 

Ast 

Ast 

0 

0 

A 

^DS 

-^DS 

0 

Ast 

0 

•^DS 

A 

-^DS 

0 

0 

Ast 

Ads 

Ads 

A 


This example would favor similarities (with the same force) only between pairs sharing a common level in 
each factor. This finer control allows users, or the employed algorithm, to penalize differences between data 
sets more (or less) strongly than differences between the ABC and GCB sub-classes. This corresponds to 
not applying direct shrinkage of interaction effects which is of interest in some situations. I 

While the penalty graph primarily serves as an intuitive overview, it does provide some aid in the con¬ 
struction of the penalty matrix for multifactorial designs. For example, the construction of the penalty 
matrix (12) in Example corresponds to a Cartesian graph product of two complete graphs similar to those 
given in and 0 - We state that V and A should be chosen carefully in conjunction with the choice of 
target matrices. Ideally, only strictly necessary penalization parameters (from the perspective of the desired 
analysis) should be introduced. Each additional penalty introduced will increase the difficulty of finding the 
optimal penalty values by increasing the dimension of the search-space. 

3.2. Selection of penalty parameters. As the £ 2 -penalty does not automatically induce sparsity in the 
estimate, it is natural to seek loss efficiency. We then use cross-validation (GV) for penalty parameter 
selection due to its relation to the minimization of the Kullback-Leibler divergence and its predictive accuracy 
stemming from its data-driven nature. Randomly divide the data of each class into k = 1,... ,K disjoint 


subsets of approximately the same size. 


Previously, we have defined fig = 


flg{A,{ftg,}g,^g) to be the 


precision matrix estimate of the gth class. Let fig*’ be the analogous estimate (with similar notational 
dependencies) for class g based on all samples not in k. Also, let S* denote the sample covariance matrix 
for class g based on the data in subset k and let denote the size of subset k in class g. The AT-fold CV 
score for our fused regularized precision estimate based on the fixed penalty A can then be given as: 

G K 




KG 


G K 


k=l 


1 


KG 


g=l k=l 
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One would then choose A* such that 


(13) A* = argminKCV(A), subject to: A > 0 A diag(A) > 0. 

A 


The least biased predictive accuracy can be obtained by choosing K = Ug such that Ug = 1. This would give 
the fused version of leave-one-out CV (LOOCV). Unfortunately, LOOCV is computationally demanding for 
large p and/or large Ug. We propose to select the penalties by the computationally expensive LOOCV only 
if adequate computational power is available. In cases where it is not, we propose two alternatives. 

Our first alternative is a special version of the LOOCV scheme that significantly reduces the computational 
cost. The special LOOCV (SLOOCV) is computed much like the LOOCV. However, only the class estimate 
in the class of the omitted datum is updated. More specifically, the SLOOCV problem is given by: 


(14) 


= arg min SLOOCV(A), 
A 


subject to: A > 0 A diag(A) > 0, 


with 


SLOOCV(A) = - — 

'^9 1 . T 

9=1 *=1 


g\ 9 


The estimate fig® in (14) is obtained by updating only tig using Proposition For all other g' ^ g, 
rjg,® = tig. The motivation for the SLOOCV is that a single observation in a given class g does not exert 
heavy direct influence on the estimates in the other classes. This way the number of fused ridge estimations 
for each given A and each given leave-one-out sample is reduced from n, to G estimations. Our second 
and fastest alternative is an approximation of the fused LOOCV score. This approximation can be used 
as an alternative to (S)LOOCV when the class sample sizes are relatively large (precisely the scenario 
where LOOCV is unfeasible). See Section 3 of the Supplementary Material for detailed information on this 
approximation. 


3.3. Choice of target matrices. The target matrices {Tg} can be used to encode prior information and 
their choice is highly dependent on the application at hand. As they influence the efficacy as well as the 
amount of bias of the estimate, it is of some importance to make a well-informed choice. Here, we describe 
several options of increasing level of informativeness. 

In the non-fused setting, the consideration of a scalar target matrix T = alp for some a S [0,oo) leads 
to a computational benefit stemming from the property of rotation equivariance |42j : Under such targets 
the ridge estimator only operates on the eigenvalues of the sample covariance matrix. This benefit transfers 
to the fused setting for the estimator described in Proposition]^ Hence, one may consider Tg = aglp with 
Og G [0, oo) for each g. The limited fused ridge problem in Price et al. [33] corresponds to choosing Og = 0 
for all g, such that a common target Tg = T = 0 is employed. This can be considered the least informative 
target possible. We generally argue against the use of the non p.d. target T = 0, as it implies shrinking the 
class precision matrices towards the null matrix and thus towards infinite variance. 

Choosing ag to be strictly positive implies a (slightly) more informative choice. The rotation equivariance 
property dictates that it is sensible to choose ag based on empirical information regarding the eigenvalues 
of Sg. One such choice could be the average of the reciprocals of the non-zero eigenvalues of Sg. A straight¬ 
forward alternative would be to choose Og = [tr(Sg)/p]“^. In the special case of (|^ where all ag = a the 
analogous choice would be a = [tr(S,)/p]“^. 

More informative targets would move beyond the scalar matrix. An example would be the consideration 
of factor-specific targets for factorial designs. Recalling Example]^ one might deem the data set factor to 
be a ‘nuisance factor’. Hence, one might choose different targets Tgcb and Tabc based on training data 
or the pooled estimates of the GCB and ABC samples, respectively. In general, the usage of pilot training 
data or (pathway) database information (or both) allows for the construction of target matrices with higher 
specificity. We illustrate how to construct targets from database information in the DLBCL application of 
Section 



TARGETED FUSED RIDGE ESTIMATION OF INVERSE COVARIANCE MATRICES 


9 


4. Fused graphical modeling 


4.1. To fuse or not to fuse. As a preliminary step to downstream modeling one might consider testing 
the hypothesis of no class heterogeneity—and therefore the necessity of fusing—amongst the class-specific 
precision matrices. Effectively, one then wishes to test the null-hypothesis iJo '■ = ■ ■ ■ — fie- Under Hq 

an explicit estimator is available in which the fused penalty parameters play no role, cf. Section 2.2 of the 
Supplementary Material. Here we suggest a score test [5] for the evaluation of Hq in conjunction with a way 
to generate its null distribution in order to assess its observational extremity. 

A score test is convenient as it only requires estimation under the null hypothesis, allowing us to exploit 
the availability of an explicit estimator. The score statistic equals: 

1 V dUg y I dflgOnj / OUg 

S=1 "''V 9 g / y - 


where denotes the precision estimate under Hq given in (S4), which holds for all classes g. The gradient 
can be considered in vectorized form and is readily available from (25). The Hessian of the log-likelihood 
equals d'^C/{drLgdrL^) = —® ^g^- For practical purposes of evaluating the score statistic, we employ 
the identity (A^ (giB) vec(C) = vec(BCA) which avoids the manipulation of {p^ x p^)-dimensional matrices. 
Hence, the test statistic U is computed by 


G G 

U = Y^ vec(Xg)^ vec(fj^‘>Xgi7^“) = ^ tr[Xg(0^«XgS7"«)], 

S=1 3=1 


where Xg = ng{2[{n^°)-^ - Sg] - - Sg] o 1^}. 

The null distribution of U can be generated by permutation of the class labels: one permutes the class 
labels, followed by re-estimation of under Hq and the re-calculation of the test statistic. The observed 
test statistic (under Hq) U is obtained from the non-permuted class labels and the regular fused estimator. 
The p-value is readily obtained by comparing the observed test statistic U to the null distribution obtained 
from the test statistic under permuted class labels. We note that the test is conditional on the choice of Agg. 


4.2. Graphical modeling. A contemporary use for precision matrices is found in the reconstruction and 
analysis of networks through graphical modeling. Graphical models merge probability distributions of ran¬ 
dom vectors with graphs that express the conditional (in)dependencies between the constituent random 
variables. In the fusion setting one might think that the class precisions share a (partly) common origin 
(conditional independence graph) to which fusion appeals. We focus on class-specific graphs Qg = {V,Eg) 
with a finite set of vertices (or nodes) V and set of edges Eg. The vertices correspond to a collection of 
random variables and we consider the same set V = {Yi,... ,Yp} of cardinality p for all classes g. That is, 
we consider the same p variables in all G classes. The edge set Eg is a collection of pairs of distinct vertices 
{Yj,Yj') that are connected by an undirected edge and this collection may differ between classes. In case we 
assume {Ti,..., Yp} ~ A/'p(0, Sg) for all classes g we are considering multiple Gaussian graphical models. 

Gonditional independence between a pair of variables in the Gaussian graphical model corresponds to zero 
entries in the (class-specific) precision matrix. Let fig denote a generic estimate of the precision matrix in 
class g. Then the following relations hold for all pairs {Yj,Yji} G V with j ^ f: 

(f8g)gy = =0 ^ Yg± Yg, \ V \ {Yg , Yg , } ill CkSS ^ {Yg , Yg, ) ^ Eg. 

Hence, determining the (in)dependence structure of the variables for class g —or equivalently the edge set 
Eg of Qg —amounts to determining the support of Clg. 


4.3. Edge selection. We stress that support determination may be skipped entirely as the estimated preci¬ 
sion matrices can be interpreted as complete (weighted) graphs. For more sparse graphical representations we 
resort to support determination by a local false discovery rate (IFDR) procedure [17] proposed by Schafer and 
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Strimmer [39| . This procedure assumes that the nonredundant off-diagonal entries of the partial correlation 
matrix 


(P.)i 


30 


' \^30 ^3'3'J 


1 

2 


follow a mixture distribution representing null and present edges. The null-distribution is known to be a 
scaled beta-distribution which allows for estimating the IFDR: 


(Pg),- 


which gives the empirical posterior probability that the edge between Yj and Yj/ is null in class g conditional 
on the observed corresponding partial correlation. The analogous probability that an edge is present can 
be obtained by considering 1 — IFDR^y. See [I71IM1I12] for further details on the IFDR procedure. Our 

strategy will be to select for each class only those edges for which 1 — IFDR^®) surpasses a certain threshold 
(see Section]^. This two-step procedure of regularization followed by subsequent support determination has 
the advantage that it enables probabilistic statements about the inclusion (or exclusion) of edges. 


4.4. Common and differential (sub-)networks. After estimation and sparsification of the class precision 
matrices the identification of commonalities and differences between the graphical estimates are of natural 
interest. Here we consider some (summary) measures to aid such identifications. Assume in the following 
that multiple graphical models have been identified by the sparsified estimates , Qq and that the 

corresponding graphs are denoted by ..., ^g- 

An obvious method of comparison is by pairwise graph differences or intersections. We utilize the differ¬ 
ential network Gg^\g2 = (f^i Eg^ \ Eg^) between class gi and g2 to provide an overview of edges present in one 
class but not the other. The common network Gin2 = (Yi Ei n E2) is composed of the edges present in both 
graphs. We also define the edge-weighted total network of m < G graphs Gi, ■ ■ ■, Gm as the graph formed by 
the union Giu-’-um = (F, Ai U • • • U E^) where the weight Wjji of the edge ejj> is given by the cardinality 
of the set {g G m} : ejj> G Eg}. More simply, ^lu-.-um is determined by summing the adjacency 

matrices of Gi to Gm- Analogously, the signed edge-weighted total network takes into account the stability of 
the sign of an edge over the classes by summing signed adjacency matrices. Naturally, the classes can also 
be compared by one or more summary statistics at node-, edge-, and network-level per class [cf. |29]. 

We also propose the idea of ‘network rewiring’. Suppose an investigator is interested in the specific 
interaction between genes A and B for classes gi and 52 - The desire is to characterize the dependency 
between genes A and B and determine the differences between the two classes. To do so, we suggest using 
the decomposition of the covariance of A and B into the individual contributions of all paths between A 
and B. A path z between A and B of length tz in a graph for class g is, following Lauritzen [24], defined 
to be a sequence A = vq, ... = B oi distinct vertices such that {vd-i,Vd) G Eg for all d = 1,...,The 

possibility of the mentioned decomposition was shown by Jones and West [52| and, in terms of = [ojjj'], 
can be stated as: 


(15) 


Cov(A, B) = E ( 1 ) ^Avi^ViV2^V2V2 

zGZab 








where Zab is the set of all paths between A and B and (Og)^p denotes the matrix with rows and columns 
corresponding to the vertices of the path z removed. Each term of the covariance decomposition in (15) can 


be interpreted as the flow of information through a given path z between A and B vaGg- Imagine performing 
this decomposition for A and B in both 17°^ and . For each path, we can then identify whether it runs 
through the common network Gging 2 ^ or utilizes the differential networks I/gaVsi>^si\s 2 unique to the classes. 
The paths that pass through the differential networks can be thought of as a ‘rewiring’ between the groups 
(in particular compared to the common network). In summary, the covariance between a node pair can be 
separated into a component that is common and a component that is differential (or rewired). 
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Example 4 . Suppose we have the following two graphs for classes 51 = 1 and 32 = 2: 



and consider the covariance between node A and B. In Qi the covariance Cov(Ya,Yb) is decomposed into 
contributions by the paths {A,B), {A,5,B), and (A, 5,4, B). Similarly for Q 2 , the contributions are from 
paths {A,5,B) and (^, 5,4,3, B). Thus {A,5,B) is the only shared path. Depending on the size of the 
contributions we might conclude that network 1 has some ‘rewired pathways’ compared to the other. This 
method gives a concise overview of the estimated interactions between two given genes, which genes mediate 
or moderate these interactions, as well as how the interaction patterns differ across the classes. In turn this 
might suggest candidate genes for perturbation or knock-down experiments. I 

5. Simulation study 

In this section we explore and measure the performance of the fused estimator and its behavior in four 
different scenarios. Performance is measured primarily by the squared Frobenius loss, 

L^f{ng{A),ng) = \\ngiA)~n,\\l, 

between the class precision estimate and the true population class precision matrix. However, the perfor¬ 
mance is also assessed in terms of the quadratic loss, 

T[f)(n,(A),n,) = ||n,(A)0;i-I,||^. 

The risk defined as the expected loss associated with an estimator, say, 

7^f{^^g(A)} =e[ 4 ®^ ( 05 (A), f 2 g)], 

is robustly approximated by the median loss over a repeated number of simulations and corresponding 
estimations. 

We designed four simulation scenarios to explore the properties and performance of the fused ridge esti¬ 
mator and alternatives. Scenario (1) evaluates the fused ridge estimator under two choices of the penalty 
matrix, the non-fused ridge estimate applied individually to the classes, and the non-fused ridge estimate 
using the pooled covariance matrix when (la) Jli = S },2 and (lb) Hi ^ H 2 . Scenario (2) evaluates the 
fused ridge estimator under different choices of targets: (2a) Ti = T 2 = 0, (2b) Ti = T 2 = alp, and 
(2c) Ti = T 2 = H. Scenario (3) evaluates the fused ridge estimator for varying network topologies and 
degrees of class homogeneity. Specifically, for (3a) scale-free topology and (3b) small-world topology, each 
with (3i) low class homogeneity and (3ii) high class homogeneity. Scenario (4) investigates the fused esti¬ 
mator under non-equal class sample sizes. Except for scenario 4, we make no distinction between the loss in 
different classes. Except for scenario 1, we use penalty matrices of the form A = Alp -I- A/(Jp — Ip). 

5.1. Scenario 1: Fusion versus no fusion. Scenario 1 explores the loss-efficiency of the fused estimate 
versus non-fused estimates as a function of the class sample size Ug for fixed p and hence for different p/n, 
ratios. Banded population precision matrices are simulated from G = 2 classes. We set p = 30 and 

( 16 ) {^g)jj' = - f\ ^ 

with k non-zero off-diagonal bands. The sub-scenario (la) Hi = H 2 uses fc = 15 bands whereas (lb) Hi yf H 2 
uses fc = 15 bands for Hi and k = 2 bands for H 2 . Hence, identical and very different population precision 
matrices are considered, respectively. 

Eor Ug = 10, 25, 70 the loss over 20 repeated runs was computed. In each run, the optimal unrestricted 
penalty matrix A was determined by LOOCV. The losses were computed for (li) the fused ridge estimator 
with an unrestricted penalty matrix, (Hi) the fused ridge estimator with a restricted penalty matrix such 
that All = A 22 , (liii) the regular non-fused ridge estimator applied separately to each class, and (liv) the 
regular non-fused ridge estimator using the pooled estimate S,. In all cases the targets Ti = T 2 = a.Ip 



12 


A.E. BILGRAU, C.F.W. PEETERS, P.S. ERIKSEN, M. B0GSTED, AND W.N. VAN WIERINGEN 


Scenario 1: G = 2, Ti = T 2 , p = 30 

Estimator |^|Fused (unrestricted)j^jFused (restricted) ^ Non-fused ^ Non-iused-pooled 


Scenario 2: G = 2, =^ 2 J^ =^ 2 , p = 50 


Target 


|Ta = 0[^T„ = al[±qTa = na 


n, =$72 



Frobenius loss 

■ +. ± 

^ ^ ^ 

Quadratic loss 


$7^ 7^ $72 

+ t 

$7, =$72 

j _ 1 _ 1 _ 1 

- 

"-ID 

0 

c 

s. 

* * ^ 

4^ 


Figure 1. A: Results for Scenario 1. The losses against the class samples size for different 
ridge estimators under unequal and equal class population matrices. B: Results for Scenario 
2. The losses against the class sample size with different target matrices. 


were used with a, = p/tr(S,). The risk and quartile losses for scenario 1 are seen in the boxplots of Figure 

[IK- 

Generally, the unrestricted fused estimates are found to perform at least as well as the (superior of the) 
non-fused estimates. This can be expected as the fused ridge estimate might be regarded as an interpo¬ 
lation between using the non-fused ridge estimator on the pooled data and within each class separately. 
Hence, the LOOCV procedure is thus able to capture and select the appropriate penalties both when the 
underlying population matrices are very similar and when they are very dissimilar. In the case of differing 
class population precision matrices, the restricted fused ridge estimator (that uses the single ridge penalty 
All = A 22 ) performs somewhat intermediately, indicating again the added value of the flexible penalty setup. 
It is unsurprising that the non-fused estimate using the pooled covariance matrix is superior in scenario lb, 
where Sli = $ 72 , as it is the explicit estimator in this scenario, cf. Section 2.2 of the Supplementary Material. 

5.2. Scenario 2: Target versus no target. Scenario 2 investigates the added value of the targeted 
approach to fused precision matrix estimation compared to that of setting Tg = 0 which reduces to the 
special-case considered by Price et al. [33]. We simulated data sets with G = 2 classes from banded precision 
matrices (as given in ( |16| )) with p = 50 variables and k = 25 bands for varying class sample sizes Ug and 
target matrices Ti and T 2 . The performance was evaluated using (2a) Ti = T 2 = 0, (2b) Tg = a,Ip, as 
above, and (2c) the spot-on target Ti = T 2 = for each of Ug = 25, 50,125 class sample sizes. 

As above, risks were estimated by the losses for each class over 20 simulation repetitions. The optimal 
penalties where determined by LOOCV with penalty matrices of the form A = Alp -I- A/(Jp — Ip). The 
results are shown in the boxplots in Panel B of Figure [l] As expected, the spot-on target shows superior 
performance in terms of loss in all cases. This suggests that well-informed choices of the target can greatly 
improve the estimation and that the algorithm will put emphasis on the target if it reflects the truth. Such 
behavior is also seen analytically in the ridge estimator of Schafer and Strimmer |39j inferred from their 
closed expression of the optimal penalty. We see that using the scalar target a. Ip resuts in an as-good or 
lower risk in terms of the quadratic but not the Frobenius loss compared to the no-target situation. 
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As this scenario corresponds to the case of Price et al. |33j we performed a secondary timing benchmark 
of their accompanying RidgeFusion package compared to rags2ridges. We evaluated estimation time of 
each package on a single simulated data set with p = 50, G = 2, and ni = n 2 = 10 using a banded matrix 
as before. The average estimation times over 100 model fits where 8.9 and 26 milliseconds for packages 
rags2ridges and RidgeFusion, respectively. This approximates a factor 2.94 speed-up for a single model 
fit. The timing was done using the package microbenchmark |28j and the estimates from each package were 
in agreement within expected numerical precision. 

5.3. Scenario 3: Varying topology and class (dis)similarity. Scenario 3 investigates the fused estima¬ 
tor with G = 3 classes for (3i) high and (3ii) low class homogeneity and two different latent random graph 
topologies on p = 50 variables. The topologies are the (3a) ‘small-world’ and the (3b) ‘scale-free’ topology 
generated by Watts-Strogatz and Barabasi graph games, respectively. The former generates topologies where 
all node degrees are similar while the latter game generates networks with (few) highly connected hubs. From 
the generated topology, we construct a latent precision matrix with diagonal elements set to 1 and the 
non-zero off-diagonal entries dictated by the network topology set to 0 . 1 . 

The two topologies are motivated as they imitate many real phenomena and processes. Small-world 
topologies approximate systems such as power grids, the neural network of the worm C. elegans, and the 
social networks of film actors Ezmi]. Conversely, scale-free topologies approximate many social networks, 
protein-protein interaction networks, airline networks, the world wide web, and the internet HIS]. 

We control the inter-class homogeneity using a latent inverse Wishart distribution for each class covariance 
matrix as considered by Bilgrau et al. [9]. That is, we let 

(17) S 5 = f^;i~Wp-i((u-p-l)$-i,u), u>p+l 

where yV’~^(0, u) denotes an inverse Wishart distribution with scale matrix © and u degrees of freedom. The 
parametrization implies the expected value E[Sg] = £[ 11 '^] = and thus $ defines the latent expected 
topology. We simulate from a multivariate normal distribution as before conditional on the realized covariance 
Sg. In ( [l7| , the parameter v controls the inter-class homogeneity. Large v imply that FJi « ^2 ~ FJa and 
thus a large class homogeneity. Small values of u —>■ (p -F I)"*" imply large heterogeneity. 

For the simulations, we chose (i) u = 100 and (ii) u = 1000. Again we fitted the model using both the zero 
target as well as the scalar matrix target described above using the reciprocal value of the mean eigenvalue, 
i.e., Ti = T 2 = T 3 = alp for both a = 0 and a = p/tr(S,). The estimation was repeated 20 times 
for each combination of high/low class similarity, network topology, choice of target, and class sample-size 
m = n 2 = ns = 25,50,125. Panels A and B of Figure show box-plots of the results. 

First, the loss is seen to be dependent on the network topology, irrespective of the loss function. Second, 
as expected, the loss is strongly influenced by the degree of class (dis)similarity where a higher homogeneity 
yields a lower loss. Intuitively, this makes sense as the estimator can borrow strength across the classes 
and effectively increase the degrees of freedom in each class. Third, the targeted approach has a superior 
loss in all cases with a high class homogeneity and thus the gain in loss-efficiency is greater for the targeted 
approach. For low class homogeneity, the targeted approach performs comparatively to the zero target with 
respect to the Frobenius loss while it is seemingly better in terms of quadratic loss. Measured by quadratic 
loss, the targeted approach nearly always outperforms the zero target. 

5.4. Scenario 4: Unequal class sizes. Scenario 4 explores the fused estimator under unequal class sample 
sizes. We simulated data with k = 8 non-zero off-diagonal bands, G = 2, and p = 50. The number of samples 
in class 2 was fixed at 712 = 30 while the number of samples in class 1 were varied: ni = 5, 25, 50, 75. The 
results of the simulation are shown in Panel C of Figurej^ Note that we consider the Frobenius and quadratic 
loss within each class separately here. 

Not surprisingly, the fused estimator performs better (for both classes) when n, increases. Perhaps more 
surprising, there seems to be no substantial difference in quadratic loss for group ni and n 2 suggesting that 
the fusion indeed borrows strength from the larger class. A loss difference is only visible in the most extreme 
case where ni = 5 and 712 = 30. The relative difference however is not considered large. 
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Figure 2. A: Scenario 3. The Frobenius loss as a function of sample size under different 
topologies and degree of similarity. B; Scenario 3. The quadratic loss as a function of sample 
size under different topologies and degree of similarity. C: Scenario 4. The loss as a function 
of sample size of class 1 with fixed sample size for class 2. 

6. Applications 

Lymphoma refers to a group of cancers that originate in specific cells of the immune system such as white 
blood T- or B-cells. Approximately 90% of all lymphoma cases are non-Hodgkin’s lymphomas—a diverse 
group of blood cancers excluding Hodgkin’s disease—of which the aggressive diffuse large B-cell lymphomas 
(DLBCL) constitutes the largest subgroup [H]. We showcase the usage of the fused ridge estimator through 
two analyzes of DLBCL data. 

In DLBCL, there exists at least two major genetic subtypes of tumors named after their similarities in 
genetic expression with activated B-cells (ABC) and germinal centre B-cells (GCB). A third umbrella class, 
usually designated as Type HI, contains tumors that cannot be classified as being either of the ABC or GCB 
subtype. Patients with tumors of GCB class show a favorable clinical prognosis compared to that of ABC. 
Even though the genetic subtypes have been known for more than a decade [T] and despite the appearance 
of refinements to the DLBCL classification system m, DLBCL is still treated as a singular disease in 
daily clinical practice and the first differentiated treatment regimens have only recently started to appear in 
clinical trials [MIEQ]. Many known phenotypic differences between ABC and GCB are associative, which 
might underline the translational inertia. Hence, the biological underpinnings and functional differences 
between ABC and GCB are of central interest and the motivation for the analyzes below. 

Incorrect regulation of the NF-kB signaling pathway, responsible for i.a. control of cell survival, has 
been linked to cancer. This pathway has certain known drivers of deregulation. Aberrant interferon /3 
production due to recurrent oncogenic mutations in the central MYD88 gene interferes with cell cycle arrest 
and apoptosis m- It also well-known that BCL2, another member of the NF-kB pathway, is deregulated in 
DLBCL [?D]. Moreover, a deregulated NF-kB pathway is a key hallmark distinguishing the poor prognostic 
ABC subclass from the good prognostic GCB subclass of DLBCL [3^. Our illustrative analyzes thus focus on 
the functional differences between ABC and GCB in relation to the NF-kB pathway. Section [O] inv estigates 
the DLBCL classes in the context of a single data set on the NF-kB signalling pathway. Section |6)2] analyzes 
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multiple DLBCL NF-kB data sets with a focus on finding common motifs and motif differences in network 
representations of pathway-deregulation. These analyzes show the value of a fusion approach to integration. 
In all analyzes we take the NF-kB pathway and its constituent genes to be defined by the Kyoto Encyclopedia 
of Genes and Genomes (KEGG) database [23]. 


6.1. Nonintegrative analysis of DLBCL subclasses. We first analyze the data from Dybkasr et al. m, 
consisting of 89 DLBGL tumor samples. These samples were RMA-normalized using custom brainarray chip 
definition files (GDE) [T2] and the R-package affy |5D|. This preprocessing used Entrez gene identifiers (EID) 
by the National Genter for Biotechnology Information (NGBI), which are also used by KEGG. The usage 
of custom CDFs avoids the mapping problems between Affymetrix probeset IDs and KEGG. Moreover, the 
custom CDFs can increase the robustness and precision of the expression estimates [111137]. The RMA- 
preprocessing yielded 19,764 EIDs. Subsequently, the features were reduced to the available 82 out of the 
91 EIDs present in the KEGG NF-kB pathway. The samples were then partitioned, using the DLBCL 
automatic classifier (DAG) by Care et al. [IT], into the three classes ABC (ni = 31), III (^2 = 13), and GCB 
(na = 45), and gene-wise centered to have zero mean within each class. 

The analysis was performed with the following settings. Target matrices for the groups were chosen to 
be scalar matrices with the scalar determined by the inverse of the average eigenvalue of the corresponding 
sample class covariance matrix, i.e.: 


Tabc = ail 


■pi 


-III 


= a2lp, Tggb = Oizlp, where a„ = 


tr(Sg 


These targets translate to a class-scaled ‘prior’ of conditional independence for all genes in NF-kB. The 
optimal penalties were determined by LOOCV using the penalty matrix and graph given in (18). Note 
that the penalty setup bears resemblance to Example]^ Differing class-specific ridge penalties were allowed 
because of considerable differences in class sample size. Direct shrinkage between ABC and GCB was 
disabled by fixing the corresponding pair-fusion penalty to zero. The remaining fusion penalties were free to 
be estimated. Usage of the Nelder-Mead optimization procedure then resulted in the optimal values given 
on the right-hand side of (18) below: 


ABC Type III GCB 



All 

Ai2 

0 ■ 


2 

1.5 X 10-3 

0 

Ai2 

A 22 

A 23 

= 

1.5 X 10-3 

2.7 
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0 

A 23 

A33_ 


0 

2 X 10-3 

2.3 


ABC 

III 

GCB 


The ridge penalties of classes ABC and GCB are seen to be comparable in size. The small size of the 
Type III class leads to a relatively larger penalty to ensure a well-conditioned and stable estimate. The 
estimated fusion penalties are all relatively small, implying that heavy fusion is undesirable due to class- 
differences. The three class-specific precision matrices were estimated under A* and subsequently scaled to 
partial correlation matrices. Panels A-C of Figure [^visualize these partial correlation matrices. In general, 
the ABC and GCB classes seem to carry more signal in both the negative and positive range vis-a-vis the 
Type III class. 

Post-hoc support determination was carried out on the partial correlation matrices using the class-wise 
IFDR approach of Section 4.3 The IFDR threshold was chosen conservatively to 0.99, selecting 39, 85, 34 
edges for classes ABC, III, GCB, respectively. The relatively high number of edges selected for the Type III 
class is (at least partly) due to the difficulty of determining the mixture distribution mentioned in Section 
|4.3| when the overall partial correlation signal is relatively flat. Panels D-E of Figure [^ then show the 
conditional independence graphs corresponding to the sparsified partial correlation matrices. We note that 
a single connected component is identified in each class, suggesting, at least for the ABC and GCB classes, 
a genuine biological signal. A secondary supporting overview is provided in Table [l] 

Table [^ gives the most central genes in the graphs of Panels D-E by two measures of node centrality: 
degree and betweenness. The node degree indicates the number of edges incident upon a particular node. 
The betweenness centrality indicates in how many shortest paths between vertex pairs a particular node acts 
as an intermediate vertex. Both measures are proxies for the importance of a feature. See, e.g., |29] for an 
overview of these and other centrality measures. It is seen that the CCL, CXCL, and TNF gene families 
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Figure 3. Top: Heat maps and color key of the partial correlation matrices for the ABC 
(panel A), III (panel B), and GCB (panel C) classes in the NF-kB signaling pathway on the 
Dybkaer et al. |14j data. Bottom: Graphs corresponding to the sparsified precision matrices 
for the classes above. Red and blue edges correspond to positive and negative partial 
correlations, respectively. Far right-panel: FID key and corresponding Human Genome 
Organization (HUGO) Gene Nomenclature Committee (HGNC) curated gene names of the 
NF-kB signaling pathway genes. Genes that are connected in panels D-F are shown bold. 


are well-represented as central and connected nodes across all classes. The gene CCL21 is very central in 
classes ABC and HI, but less so in the GCB class. From Panels D-E of Figure it is seen that BCL2 and 
BCL2A1 are only connected in the non-ABC classes. Contrary to expectation, MYD88 is disconnected in all 
graphs. The genes ZAP70, LAT, and LCK found in Figurej^and Table[T]are well-known T-cell specific genes 
involved in the initial T-cell receptor-mediated activation of NF-kB in T-cells [7]. From the differences in 
connectivity of these genes, different abundances of activated T-cells or different NF-kB activation programs 
for ABC/GCB might be hypothesized. 

6.2. Integrative DLBCL analysis. We now expand the analysis of the previous section to show the 
advantages of integration by fusion. A large number of DLBCL gene expression profile (GEP) data sets 
is freely available at the NGBI Gene Expression Omnibus (GEO) website [ 3 . We obtained 11 large-scale 
DLBGL data sets whose GEO-accession numbers (based on various Affymetrix microarray platforms) can be 
found in the first column of Table One of the sets, with GEO-accession number GSE11318, is treated as 
a pilot/training data set for the construction of target matrices (see below). The GSE10846 set is composed 
of two distinct data sets corresponding to two treatment regimens (R-CHOP and GHOP) as well as different 
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Table 1. The most central genes, their EID, and their plot index. For each class and 
node, the degree (with the number of positive and negative edges connected to that node in 
parentheses) and the betweenness centrality is shown. Only the 15 genes with the highest 
degrees summed over each class are shown. 



EID 

Index 

ABC 


III 

GCB 


Degree 

Betw. 

Degree 

Betw. 

Degree 

Betw. 

CCL21 

6366 

77 

9(5+,4-) 

202.0 

17(9+, 8") 

297.00 

4(3+,l-) 

106 

CXCL8 

3576 

38 

5(2+,3-) 

126.0 

12 (4+,8") 

234.00 

4(l+.3-) 

56 

CCL19 

6363 

78 

4(4+,0-) 

120.0 

10 (6+,4-) 

91.70 

6 (6+,0-) 

230 

LTA 

4049 

80 

5(3+,2-) 

143.0 

10 (6+,4-) 

195.00 

3(3+,0-) 

56 

CXCL12 

6387 

40 

3(2+,l-) 

84.2 

12(5+, 7-) 

187.00 

2 (2+,0-) 

27 

CXCL2 

2920 

76 

3(3+,0-) 

61.0 

11(5+,6-) 

196.00 

3(2+,l-) 

53 

LTB 

4050 

81 

4(3+,l-) 

85.5 

5(3+,2-) 

4.24 

6(3+,3-) 

98 

CD14 

929 

51 

3(2+,l-) 

20.2 

6(3+,3-) 

25.90 

3(2+,l-) 

32 

CCL4 

6351 

74 

2 (1+,1-) 

5.0 

8(5+,3-) 

118.00 

2 (1+,1-) 

4 

ZAP70 

7535 

48 

3(2+,l-) 

60.0 

5(4+,l-) 

50.70 

3(2+,l-) 

75 

CCL13 

6357 

39 

4(3+,l-) 

119.0 

5(3+,2-) 

19.70 

i(i+,o-) 

0 

TNFSFll 

8600 

42 

5(4+,l-) 

160.0 

2 (1+,1-) 

0.00 

3(2+,l-) 

55 

TNF 

7124 

16 

i(i+,o-) 

0.0 

4(2+,2-) 

1.68 

3(3+,0-) 

24 

LAT 

27040 

49 

2 (2+,0-) 

0.0 

4(4+,0-) 

15.80 

2 (2+,0-) 

0 

LCK 

3932 

62 

2 (0+,2-) 

31.0 

3(3+,0-) 

10.00 

3(2+,l-) 

64 


Table 2. Overview of data sets, the defined classes, and the number of samples. In 
GSE31312, 28 samples were not classified with the DAC due to technical issues and hence do 
not appear in this table. In the pilot study GSE11318, 31 samples were primary mediastinal 
B-cell lymphoma and left out. Note also that the pilot data set GSE11318 was not classified 
by the DAG. 



ABC 

Type III 

GBC 

J2n-g 

g 

Ug 

9 

Ug 

9 

Ug 

Pilot data 

GSE11318 


74 


71 


27 

172 

Data set 








GSE56315 

1 

31 

2 

13 

3 

45 

89 

GSE19246 

4 

51 

5 

30 

6 

96 

177 

GSE12195 

7 

40 

8 

18 

9 

78 

136 

GSE22895 

10 

31 

11 

21 

12 

49 

101 

GSE31312 

13 

146 

14 

97 

15 

224 

467 

GSE10846.GHOP 

16 

64 

17 

28 

18 

89 

181 

GSE10846.RCHOP 

19 

75 

20 

42 

21 

116 

233 

GSE34171.hgul33plus2 

22 

23 

23 

15 

24 

52 

90 

GSE34171.hgul33AplusB 

25 

18 

26 

17 

27 

43 

78 

GSE22470 

28 

86 

29 

43 

30 

142 

271 

GSE4475 

31 

73 

32 

20 

33 

128 

221 



638 


344 


1062 

2044 


time-periods of study. Likewise, GSE34171 is composed of three data sets corresponding to the respective 
microarray platforms used: HG-U133A, HG-U133B, and HG-U133 plus 2.0. As the samples on HG-U133A 
and HG-U133B were paired and run on both platforms, the (overlapping) features were averaged to form a 
single virtual microarray comparable to that of HG-U133 plus 2.0. Note that the Dybk®r et al. m data 
used in Section [O] is part of the total batch under GEO-accession number GSE56315. The sample sizes for 
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the individual data sets vary in the range 78-495 and can also be found in Table The data yield a total 
of 2,276 samples making this, to our knowledge, the hitherto largest integrative DLBCL study. 

Similar to above, all data sets were RMA-normalized using custom brainarray CDFs and the R-package 
affy. Again, NCBI EIDs were used to avoid non-bijective gene-ID translations between the array-platforms 
and the KEGG database. The freely available R-package DLBCLdata was created to automate the download 
and preprocessing of the data sets in a reproducible and convenient manner. See the DLBCLdata documen¬ 
tation [5] for more information. Subsequently, the data sets were reduced to the intersecting 11,908 EIDs 
present on all platforms. All samples in all data sets, except for the pilot study GSE11318, were classified as 
either ABG, GCB, or Type 111 using the DAG mentioned above. The same classifier was used in all data sets 
to obtain a uniform classification scheme and thus maximize the comparability of the classes across data sets. 
Subsequently, the features were reduced to the EIDs present in the NF-kB pathway and gene-wise centered 
to have zero mean within each combination of DLBGL subtype and data set. We thus have a a two-way 
study design—DLBGL subtypes and multiple data sets—analogous to Example A concise overview of 
each of the 11 x 3 = 33 classes for the non-pilot data is provided in Table 

The target matrices were constructed from the pilot data in an attempt to use information in the directed 
representation of the NF-kB pathway obtained from KEGG. The directed graph represents direct and 
indirect causal interactions between the constituent genes. It was obtained from the KEGG database via 
the R-package KEGGgraph [SO]. A target matrix was constructed for each DLGBL subtype using the pilot 
data and the information from the directed topology by computing node contributions using multiple linear 
regression models. That is, from an initial T = 0, we update T for each node a G E(0pw) through the 
following sequence: 


T„ 


^ pa(Q) ,q; 
^a,pa(Q;) 
- pa(a),pa(a) 


— T -I- d- 

— a^a W 2 


Tpa(Q)^a -|- g.2/3pa(Q) 

= TQ,_pa(Q,) -|- y2^pa(a) 

= Tpa(a),pa(o!) + y2/3pa(a)/3pa(a)! 


where pa(a) denotes the parents of node a in Gpw, and where a and (3 are the residual standard error and 
regression coefficients obtained from the linear regression of a on pa(Q:). By this scheme the target matrix 
represents the conditional independence structure that would result from moralizing the directed graph. If 
^pw is acyclic then T 0 is guaranteed. 

The penalty setup bears resemblance to Example The Type III class is considered closer to the ABG 
and GCB subtypes than ABC is to GCB. Thus, the direct shrinkage between the ABC and GCB subtypes 
was fixed to zero. Likewise, direct shrinkage between subtype and data set combinations was also disabled. 
Hence, a common ridge penalty A, a data set-data set shrinkage parameter Ads and a subtype-subtype 
shrinkage parameter Ast were estimated. The optimal penalties were determined by SLOOCV using the 


penalty matrix and graph given in (191 below: 

Type III 


ABC 


GCB 


(19) 



A = 


- A 

Ast 

0 

Ads 

0 

0 • 

• Ads 

0 

0 1 

Ast 

A 

Ast 

0 

Ads 

0 ■ 

• 0 

Ads 

0 

0 

Ast 

A 

0 

0 

Ads • 

• 0 

0 

Ads 

Ads 

0 

0 

A 

Ast 

0 • 

• Ads 

0 

0 

0 

Ads 

0 

Ast 

A 

Ast • 

• 0 

Ads 

0 

0 

0 

Ads 

0 

Ast 

A • 

■ 0 

0 

Ads 

Ads 

0 

0 

Ads 

0 

0 ■ 

■ A 

Ast 

0 

0 

Ads 

0 

0 

Ads 

0 ■ 

• Ast 

A 

Ast 

- 0 

0 

Ads 

0 

0 

Ads • 

• 0 

Ast 

A J 


The optimal penalties were found to be A* = 2.2 for the ridge penalty, A^g = 0.0022 for the data set fusion 
penalty, and Ag^p = 6.8 x 10“^ for the subtype fusion penalty, respectively. 
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Figure 4. Summary of the estimated precision matrices for the NF-kB pathway. Top 
row. Heat maps of the estimated precision matrices pooled across data sets for each genetic 
subtype. Middle row from left to right: The pooled target matrix for ABC, the difference 
between the pooled ABC and GCB estimates, and the pooled target matrix for GCB. 
Bottom: The color key for the heat maps. 


To summarize and visualize the 33 class precision estimates they were pooled within DLBCL subtype. 
Panels A-C of Figure visualizes the 3 pooled estimates as heat maps. Panels D and F visualize the 
constructed target matrices for the ABC and GCB subtypes, respectively. Panel E then gives the difference 
between the pooled ABC and GCB estimates, indicating that they harbor differential signals to some degree. 
We would like to capture the commonalities and differences with a differential network representation. 

The estimated class-specific precision matrices were subsequently scaled to partial correlation matrices. 
Each precision matrix was then sparsified using the lEDR procedure of Section |4.3[ Civen the class an edge 
was selected whenever 1 — lEDR > 0.999. To compactly visualize the the multiple GGMs we obtained signed 
edge-weighted total networks mentioned in Section [4.4[ Clearly, for inconsistent connections the weight would 
vary around zero, while edges that are consistently selected as positive (negative) will have a large positive 
(negative) weight. These meta-graphs are plotted in Figure Panels A-C give the signed edge-weighted 
total networks for each subtype across the data sets. They show that (within DLBCL subtypes) there are a 
number of edges that are highly concordant across all data sets. To evaluate the greatest differences between 
the ABC and GCB subtypes, the signed edge-weighted total network of the latter was subtracted from the 
former. The resulting graph ^abc-GCB can be found in Panel D. Edges that are more stably present in 
the ABC subtype are represented in orange and the edges more stably present in the GCB subtype are 
represented in blue. Panel E represents the graph from panel D with only those edges retained whose 
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absolute weight exceeds 2. In a sense, the graph of panel F then represents the stable differential network. 
The strongest connections here should suggest places of regulatory deregulation gained or lost across the 
two subtypes. Interestingly, this differential network summary shows relatively large connected subgraphs 
suggesting differing regulatory mechanisms. 

The graph in panel F of Figurej^then conveys the added value of the integrative fusion approach. Certain 
members of the CCL, CXCL, and TNF gene families who were highly central in the analysis of Section [6T] 
are still considered to be central here. However, it is also seen that certain genes that garnered high centrality 
measures in the single data set analyzed in Section 6.1 do not behave stably across data sets, such as CXCL2. 
In addition, the integrative analysis appoints the BCL2 gene family a central role, especially in relation to the 
ABC subtype. This contrasts with Section [O] where the BCL2 gene family was not considered central and 
appeared to be connected mostly in the non-ABC classes. Moreover, whereas the analysis of the single data 
set could not identify a signal for MYD88, the integrative analysis identifies MYD88 to be stably connected 
across data sets. Especially the latter two observations are in line with current knowledge on deregulation 
in the NF-kB pathway in DLBCL patients. Also in accordance with litterature is the known interaction of 
LTA with LTB seen in panel F of Figure]^ [451HH] which here appear to be differential between ABC/GCB. 
Thus, borrowing information across classes enables a meta-analytic approach that can uncover information 
otherwise unobtainable through the analysis of single data sets. 


7. Discussion and conclusion 

We considered the problem of jointly estimating multiple inverse covariance matrices from high-dimen¬ 
sional data consisting of distinct classes. A fused ridge estimator was proposed that generalizes previous 
contributions in two principal directions. First, we introduced the use of targets in fused ridge precision 
estimation. The targeted approach helps to stabilize the estimation procedure and allows for the incor¬ 
poration of prior knowledge. It also juxtaposes itself with various alternative penalized precision matrix 
estimators that pull the estimates towards the edge of the parameter space, i.e., who shrink towards the 
non-interpretable null matrix. Second, instead of using a single ridge penalty and a single fusion penalty 
parameter for all classes, the approach grants the use of class-specific ridge penalties and class-pair-specific 
fusion penalties. This results in a flexible shrinkage framework that (i) allows for class-specific tuning, that 
(ii) supports analyzes when a factorial design underlies the available classes, and that (iii) supports the ap¬ 
propriate handling of situations where some classes are high-dimensional whilst others are low-dimensional. 
Targeted shrinkage and usage of a flexible penalty matrix might also benefit other procedures for precision 
matrix estimation such as the fused graphical lasso m- 

The targeted fused ridge estimator was combined with post-hoc support determination, which serves as a 
basis for integrative or meta-analytic Gaussian graphical modeling. This combination thus has applications 
in meta-, integrative-, and differential network analysis of multiple data sets or classes of data. This meta¬ 
approach to network analysis has multiple motivations. First, by combining data it can effectively increase 
the sample size in settings where samples are relatively scarce or expensive to produce. In a sense it refocuses 
the otherwise declining attention to obtaining a sufficient amount of data—a tendency we perceive to be 
untenable. Second, aggregation across multiple data sets decreases the likelihood of capturing idiosyncratic 
features (of individual data sets), thereby preventing over-fitting of the data. 

Insightful summarization of the results is important for the feasibility of our approach to fused graphical 
modeling. To this end we have proposed various basic tools to summarize commonalities and differences over 
multiple graphs. These tools were subsequently used in a differential network analysis of the NF-/cB signaling 
pathway in DLBCL subtypes over multiple GEP data sets. This application is not without critique, as it 
experiences a problem present in many GEP studies: The classification of the DLBGL subtypes (ABC and 
GBC) is performed on the basis of the same GEP data on which the network analysis is executed. This may 
be deemed methodologically undesirable. However, we justify this double use of data as (a) the pathway of 
interest involves a selection of genes whereas the classification uses all genes, and (b) the analysis investigates 
partial correlations and differential networks whereas the classihcation, in a sense, considers only differential 
expression. Furthermore, as in all large-scale genetic screenings, the analyzes should be considered ‘tentative’ 
and findings need to be validated in independent experiments. Notwithstanding, the analyzes show that 
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Index Entrez HUGO 


148022 TICAM1 


Figure 5. Summary of estimated GGMs for the NF-kB pathway. Panels A-C: Graphs 
obtained by adding the signed adjacency matrices for each subtype across the data sets. The 
edge widths are drawn proportional to the absolute edge weight. Panel D: Graph obtained 
by subtracting the summarized signed adjacency matrix of GGB (panel A) from that of 
ABC (panel C). Edge widths are drawn proportional to the absolute weight and colored 
according to the sign. Orange implies edges more present in ABC and blue implies edges 
more present in GGB. Panel E: As the graph in panel D, however only edges with absolute 
weight > 2 are drawn. Panel F: As the graph in panel E, but with an alternative layout. Far 
right-panel: FID key and corresponding HGNC curated gene names of the NF-kB pathway 
genes. Genes that are connected in panel F are shown bold. 

the fusion approach to network integration has merit in uncovering class-specific information on pathway 
deregulation. Moreover, they exemplify the exploratory hypothesis generating thrust of the framework we 
offer. 

We see various inroad for further research. With regard to estimation one could think of extending the 
framework to incorporate a fused version of the elastic net. Mixed fusion, in the sense that one could do 
graphical lasso estimation with ridge fusion or ridge estimation with lasso fusion, might also be of interest. 
From an applied perspective the desire is to expand the toolbox for insightful (visual) summarization of 
commonalities and differences over multiple graphs. Moreover, it is of interest to explore improved ways 
for support determination. The IFDR procedure, for example, could be expanded by considering all classes 
jointly. Instead of applying the IFDR procedure to each class-specific precision matrix, one would then be 
interested in determining the proper mixture of a grand common null-distribution and multiple class-specific 
non-null distributions. These inroads were out of the scope of current work, but we hope to explore them 
elsewhere. 

7.1. Software implementation. The fused ridge estimator and its accompanying estimation procedure 
is implemented in the rags2ridges-package |,11j for the statistical language R. This package has many 
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supporting functions for penalty parameter selection, graphical modeling, as well as network analysis. We 
will report on its full functionality elsewhere. The package is freely available from the Comprehensive R 
Archive Network: http: //cran. r-proj ect. org/ 
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Appendix A. Geometric interpretation of the fused ridge penalty 


Some intuition behind the fused ridge is provided by pointing to the equivalence of penalized and con¬ 
strained optimization. To build this intuition we study the geometric interpretation of the fused ridge penalty 
in the special case of Q with T = 0. In this case \gg = A for all g, and Xg^g^ — ^.ll gi ^52- Clearly, 

the penalty matrix then amounts to A = Alp -I- A/(Jp — Ip). Matters are simplified further by considering 
G = 2 classes and by focusing on a specific entry in the precision matrix, say {flg)jji = , for 5 = 1,2. By 

doing so we ignore the contribution of other precision elements to the penalty. Now, the fused ridge penalty 
may be rewritten as: 


A 

2 



||2 

\\f 



31 = 1 92 = 1 


hi 

2 


Hi — H2 


||2 

IIf' 


Subsequently considering only the contribution of the entries implies this expression can be further 
reduced to: 


A 

2 



2 ^ JJ 33 ') 


X + Xf 



^f^33'^33'- 


It follows immediately that this penalty imposes constraints on the parameters ujjy and amounting to 
the set: 


( 20 ) 


wll' ’ ^ 33 ') 


A + A/ 
2 



33 ' 


XfU3F;U3 


< c 
33 ' — 


for some c G K+. It implies that the fused ridge penalty can be understood by the implied constraints on 
the parameters. Figure shows the boundary of the set for selected values. 

Panel reveals the effect of the fused, inter-class penalty parameter Xf (while keeping A fixed). At 
Xf = 0, the constraint coincides with the regular ridge penalty. As A/ increases, the ellipsoid shrinks along 
the minor principal axis x = —y with no shrinkage along x = y. In the limit A/ —> 00 the ellipsoid collapses 


onto the identity line. Hence, the parameters and are shrunken towards each other and while their 
differences vanish, their sum is not affected. Hence, the fused penalty parameter primarily shrinks the ‘sum 
of the parameters’, but also fuses them as a bound on their sizes implies a bound on their difference. 

Panelshows the effect of the intra-class A penalty (while keeping A/ fixed). When the penalty vanishes 
for A —> 0 the domain becomes a degenerated ellipse (i.e. cylindrical for more than 2 classes) and parameters 
ujjy and uj^y may assume any value as long as their difference is less than y/2cjXf. For any A > 0, the 
parameter-constraint is ellipsoidal. As A increases the ellipsoid is primarily shrunken along the principal 
axis formed by the identity line and along the orthogonal principal axis (jj = —x). In the limit A —>■ 00 the 
ellipsoid collapses onto the point (0,0). It is clear that the shape of the domain in (201 is only determined 
by the ratio of A and Xf. 
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Figure 6 . Visualization of the effects of the fused ridge penalty in terms of constraints. 
The left panel shows the effect of A/ for fixed A. Here, A/ = 0 is the regular ridge penalty. 
The right panel shows the effect of A while keeping A/ fixed. 


The effect of the penalties on the domain of the obtainable estimates can be further understood by noting 
that the fused ridge penalty @ can be rewritten as 

(21) A ^ - T,J + - TgJ11^ + A/ ^ \\{ng, - T,J - - TgJ || ^, 

91,92 91,92 


for some penalties A and A/. The details of this derivation can be found in Section A.l below. The first and 
second summand of the rewritten penalty (211 respectively shrink the sum and difference of the parameters 
of the precision matrices. Their contributions thus coincide with the principal axes along which two penalty 
parameters shrink the domain of the parameters. 


A.l. Alternative form for the fused ridge penalty. This section shows that the alternative form (21) 
for the ridge penalty can be written in the form (|^. We again assume a common ridge penalty \gg = A 
and a common fusion penalty Xg^^g^ = A/ for all classes and pairs thereof. To simplify the notation, let 
Now, 


Ag = Ug- Tg. 
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Hence, the alternative penalty (21) is also of the form @ and thus the fused ridge of ( pTj ) is equivalent to 
(|^ for appropriate choices of the penalties. 


Appendix B. Results and proofs 


Section B.l contains supporting results from other sources and results in support of Algorithmic Section 
B.2| contains proofs of the results stated in the main text as well as additional results conducive in those 
proofs. 


B.l. Supporting results. 

Lemma 1 (van Wieringen and Peeters [42]). Amend the log-likelihood 0 with the i 2 -penalty 

hn-Tf 

2 " 

with T S denoting a fixed symmetric p.s.d. target matrix, and where A G (0, oo) denotes a penalty 
parameter. The zero gradient equation w.r.t. the precision matrix then amounts to 

(22) - (S - AT) - AO = 0, 

whose solution gives a penalized ML ridge estimator of the precision matrix: 


(l{\) = 


AR 


i(S-AT)2 


1 1/2 


^(S-AT) 


Lemma 2 (van Wieringen and Peeters |12|). Consider J7(A) from Lemma^ and define [J^(A)] ^ = S(A). 
The following identity then holds: 

S - AT = S(A) - AO(A). 

Lemma 3. Let A G be a matrix of fixed penalty parameters such that A > 0. Moreover, let {Tg} G S^. 
Then if diag(A) > 0, the problem of (jC is strictly concave. 

Proof of Lemma\^ By diag(A) > 0, it is clear that the fused ridge penalty (j^) is strictly convex as it is 
a conical combination of strictly convex and convex functions. Hence, the negative fused ridge penalty is 
strictly concave. The log-likelihood of (j^) is a conical combination of concave functions and is thus also 
concave. Therefore, the penalized log-likelihood is strictly concave. □ 

B.2. Proofs and additional results. 


Proof of Proposition [I] To find the maximizing argument for a specific class of the general fused ridge pe¬ 
nalized log-likelihood problem 0 we must obtain its first-order derivative w.r.t. that class and solve the 
resulting zero gradient equation. To this end we first rewrite the ridge penalty (j^ into a second alternative 
form. Using that A = A^, and keeping in mind the cyclic property of the trace as well as properties of Clg 
and Tg stemming from their symmetry, we may find: 


/™"({f^g};A,{Tg}) 


^ ^ ^gg I 


^g 


'^9192 


(rjgi TgJ (rjgj T^ 


^92) 


= ^ ^ tr [(f^g-Tg)T(f2g-Tg)] - ^ ^ tr [(Og,-TgjT(f|^^_Tgj] , 


91^92 

91^92 


(23) 
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where A^, = ^ , Xgg' denotes the sum over the gth row (or column) of A. Taking the first-order partial 
derivative of (23) w.r.t. flg^ yields: 

^ rFR" 


dfl 


go 




(24) — •^90 • [2(^30 Tgo) i^go Tgg)oIp] Aggo [2(r2g Tg ) {fig Tg)oIp]. 

g=Xgo 

The first-order partial derivative of ([^ w.r.t. flg^ results in: 


dfl 


^ £({flg}; {Sg}) = ^ ^ ng{ In \ flg \ - tliSgQg )}, 


90 


90 


(25) 


= ng„ [2(f^-l- SgJ - (f2g-'- SgJ O Ig] . 


Subtracting (24) from (25) yields 
(26) 


''50 


(^9o ®3o) •^go»(^9o '^go) 4“ ^ggoi^g "^ 9 ) 

g¥=go 


O (2Jg 1^)5 


which, clearly, is 0 only when Ug^iflg^^- SgJ - Ago,(f2go-TgJ -h Eg/go ^ggoi^g-'^g) = 0- From ([2^ we 
may then find our (conveniently scaled) zero gradient equation to be: 


(27) 


^g'o'- Sso - ^(^go-TgJ + 5] ^(l^g-Tg) = 0. 

«90 g^g^ rlg„ 


Now, rewrite (27) to 
(28) 


^-1 


^90 '*'90(^90 Tgo) — 0 , 


where Sg^ = Sg^ - T,g^go (^9“ ^^g), Tg^ = Tg^, and Xg„ = Xg„,/ngg. It can be seen that ([^ is of the 
form (22). Lemma may then be applied to obtain the solution ( 0 . □ 

Corollary 1. Consider the estimator 0. Let flg{A, {flgi}gi^g^ be the precision matrix estimate of the gth 
class. Also, let diag(A) > 0 and assume that all off-diagonal elements of A are zero. Then Jig (A, {Jlg'jg'^^g) 
reduces to the non-fused ridge estimate of class g: 

-1 


(29) fjg(A, {r2g/}g/^g) — flg { Xgg ) — 


Xa 


Xa 


-^Ip + T Sg - ^Tg 

rig 4 V n 


1/2 


Ao 


D ( Sg - l^^Tg 
^9 


Proof of Corollary^ The result follows directly from equations 0 and 0 by using that J2g'^g^gg' — 
T.g'^g \'g = 0 for all g. □ 

Lemma 4. Let {Tg} € and assume Xgg S K++ in addition to 0 < Xgg> < 00 for all g' g. Then 


'99 

lim 

Aq o ¥ 00 


J 19 (A, {Jlg'lg'^r^g) 


< 00 . 


Proof of Lemma The result is shown through proof by contradiction. Hence, suppose 

lim ^ ||$lg(A, {Og/}g/,^g)||i? 

Agg—>-00 

is unbounded. Let d[-]jj denote the jth largest eigenvalue. Then, as 


fig (a, jUg/jg/^g) g, ” } ^9 (I { ^ 9' i 9'7^9 ) 

9 = 1 


1/2 
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at least one eigenvalue must tend to infinity along with Xgg. Assume without loss of generality that this is 
only the first (and largest) eigenvalue: 


(30) 


lim d 

Xna — 




for some 7 > 0. Now, for any Xgg, the precision can be written as an eigendecomposition: 


(31) 


Clg{A,{ng,}g,^g) =dllVlVj +'^dggVgVj, 

1=2 


where the dependency of the eigenvalues and eigenvectors on the target matrices and penalty parameters 
has been suppressed (for notational brevity and clarity). It is the first summand on the right-hand side 
that dominates the precision for large Xgg. Furthermore, this ridge ML precision estimate of the gth group 
satisfies, by ( [^ , the following gradient equation: 

^ 9(^9 ~ ^ 9 ) ~ ^ggi^g~^g) ~ '^ 9 ' 9 (^ 9 ~"^ 9 ) + ^ 9 ' 9 (^ 9 '~"^ 9 ') = 

g'¥^g g'¥=g 


We now make three observations: (i) From item ^ of Proposition it follows that llg(A, is 


gg ^ ^^++. Consequently, lim;^^ 


- 11^9 {^9'}9'#9) Hi? < 00 ; (ii) The target matrices 

gg- ensure that the norms of fig- can only exceed the norm of fig 


always p.d. for Xgg G 

do not depend on Xgg; and (iii) The finite A 
by a function (independent of Xgg) of the constant Xgg-. Hence, in the limit, the norms of the fig- cannot 
exceed the norm of fig. These observations give that, as Xgg tends towards infinity, the term Agg(ng—Tg) 
will dominate the gradient equation. In fact, the term Xggflg will dominate as, using (30) and (HI]): 


0 ~ -^99(^9 "^9) 

~ -AggdllVlV]'^ -I- AggT 
« -Aj + Wlv7+AggT 
« -Ai+^(viv7+Ag7T) 

« -Ag+^viv7. 


This latter statement is contradictory as it can only be true if the first eigenvalue tends to zero. This, in turn, 
contradicts the assumption of unboundedness (in the Frobenius norm) of the precision estimate. Hence, the 
fused ridge ML precision estimate must be bounded. □ 


Proof of Proposition 

(|^ Note that ([27| for class g may be rewritten to 



implying that ([^ can be obtained under the following alternative updating scheme to 

Sg = Sg, Tg=Tg+^ ^(J^g,-Tg0, and Ag=^. 

g,^g y y 

Now, let d[ • ]jj denote the jth largest eigenvalue. Then 






d{[f 8 g]-l} =d 

^33 

^(S9-7Tg) 

.. + \ 
33 \ 

{.[ 


A9Tg) 


+ Ao > 0, 


2 . 


when Ag > 0. As Ag = Thg'i^g'gl'^g) ^g'g be 0 for all g' g, fig is guaranteed to be p.d. 

whenever Xgg G R++. 
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(|^ Note that ^gg' = ^g '9 = 0 implies that tig reduces to the non-fused class estimate (29) 

by way of Corollary!^ The stated right-hand limit is then immediate by using Xgg = 0 in (29). Under the 
distributional assumptions this limit exists with probability 1 when p < Ug. 


( [m| ) Consider the zero gradient equation (27) for the gth class. Multiply it by Ug/Xg, to factor out the 
dominant term: 


(32) 


When A 


Agm Ag. 


Xg'g 


^ -fJ>{ng,-Tg.)=Q. 

i-L '^g* 

g ¥=g 


gg 


X) , Ag, = J2g' ^gg' oo ) implying that the hrst two terms of (32) vanish. Under the 
assumption that Xgg> < oo for all g' ^ g we have that Ag/g/Ag, —)■ 0 when Xgg —>■ oo“ for all g' ^ g. Thus, all 
terms of the sum also vanish as Lemma [^implies that the Ug/ are all bounded. Hence, when Xgg —)■ oo“ and 
Xgg! < 00 for all g' ^ g, the zero gradient equation reduces to 0,g — Tg = 0, implying the stated left-hand 
limit. 


The proof strategy follows the proof of item|^ Multiply the zero gradient equation (27) for the gith 
to obtain: 


class with rigj /A, 
(33) 


9i92 

n 


9i 


Xn 


. -1 7 

-ft - 

91 \ 

A. 


9i 


5g,-^(ng,-TgJ 

'^9192 


E 

9'#9l 


^ing,-Tg,) = 0. 


The first two terms are immediately seen to vanish when Xg-^g^ 
penalties except A, 


gjgj are finite, we have that Xg^,/Xg-^g^ 1 for Ag^g^ 


sum term in (33) vanish except the element where g’ = g 2 - Hence, when A 


for all { 51 , 52 } 7^ { 91 ^ 92 }, the zero gradient equation for class 51 reduces to: 
(34) — (rjgj — Tgj^) -|- (rigj — Tg^) = 0. 


00 “. Under the assumption that all 
00 “. Similarly, all elements of the 
00 “ and when Ag/g' < 00 


9192 


Conversely, by multiplying the zero gradient equation (27) for the 52 th class with n-g^/Ag^g^ one obtains, 


through the same development as above, that the zero gradient equation for class 52 reduces to the rjg^- 
analogy of equation (34). The result (IMl) then immediately implies the stated limiting result. □ 


Corollary 2. Consider item [w| of Proposition [1} When, in addition, Tgj^ = Tg^, we have that 


lim (Ogj Tgj) — 


lim 

>00- 


(^92 Tgj) 


^9l “ ^92 • 


Proof of Corollary^ The implication follows directly by using Tg^ = Tg^ in (34). 


□ 


Proof of Proposition The result follows directly from Proposition and Lemma 


□ 


Proof of Proposition^ Note that line|^of Algorithmic implies that the initializing estimates are p.d. More¬ 
over, regardless of the value of the fused penalties (in the feasible domain), the estimate in linej^of Algorithm 
ICis p.d. as a consequence of Proposition □ 
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SUPPLEMENTARY MATERIAL 


1. Alternative fused ridge solutions 

This section derives two equivalent (in terms of Q) alternative updating schemes to The motivation 
for the exploration of these alternative recursive estimators is twofold. First, alternative recursions can 
exhibit differing numerical (in)stability for extreme values of the penalty matrix A = [Agjg 2 ]' Second, they 
provide additional intuition and understanding of the targeted fused ridge estimator. 

The general strategy to finding the alternatives is to rewrite the gradient equation ( |27| ) into the non-fused 
form (28), which we will repeat here: 

(SI) 


-1 


^So "^90 (^90 ~ ^ 


where Ag^, Tg^,, and Sgg do not depend on Fig^. Note that an explicit closed-form solution to (SI I exists in 
the form of 0- 


1.1. First alternative. The first alternative scheme is straightforward. Rewrite (27) to: 

(^2) 0 = UggClg^ — UggSgg — Xggt(tlgg — Tgg) + Y! '^990 (^9 ~ "^ 9 ) 

9/90 


— Uggflgg UggSgg Xgg, ^ ^ g g 


X 


T90+ E A^(f^ 9 -Tg) 

^90 • 


97^90 


where Xgg, = J2g ^ggo- terms of (SI I, we thus have the updating scheme given in equation ([^. As stated 
in the main text, it has the intuitive interpretation that a fused class target is used which is a combination 
of the class-specific target and the ‘target corrected’ estimates of remaining classes. 

1.2. Second altern ativ e. We now derive a second alternative recursion scheme. Add and subtract 
^go* J2g^gg "^ 990 ^ 9 ! I and rewrite such that: 

= ^90^90 “ ^90^90 “ AgQ.(rJ 9 Q— Tgg) -|- Agg. y ] XgggClg + ^ ( Xggg{Xlg— Tg) — Xgg, ^ ( AgggJlg 


— Uggflgg UggSgg Xgg, 


— UggClgg UggSgg Xgg, 


97^90 

^90“ ( '^90 T y ] Agggilt 


95^90 


97^90 


97^90 


^90 ( '^90 T y ] Agggilt 

97^90 


+ y ] Agggflg ^ ( Aggg Tg Agg. ^ ( AgggfJg 
9/90 97^90 97^90 

“ y ] AgggTg — (Agg. —l) ^ ( Xgggflg 

9/90 9/90 


flggSlgO Ugg 


Dividing by Ugg gives 


E ''‘990 ^9 + E '^9 


97^90 


/ Ugg 
97^90 


Xgn, 


^90 ( "^90 “f y ] Aggg fit 

9/90 


0 = ^go- 


c i ''90 
^90 


‘ y: x„,n, + y: (fiT. 


gX=go 


g^go 


Ago* 


^90 f Tgg -l- E! ^ggo^i 
^ g^go 


which brings the expression to the desired form (SI) with the updating scheme 
Agn*~l 


Sgo = Sgg 


-- EMof^g+E^Tg, Tgo=Tgg+EMof^9. aiid To = ^- 

g^go g^go g^go 


Again, a solution for flgg with fixed fig for all g go, is available through Lemma E [42] and is given in (j^. 
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1.3. Motivation. Though seemingly more complicated, these alternative updating schemes can be numer¬ 
ically more stable for extreme penalties. In both alternatives, we see that is p.s.d. for (nearly) all very 


large and very small penalties. Likewise, is always positive definite. Compare the alternative expressions 
to the updating scheme given by ([^ which can be seen to be numerically unstable for very large penalties: 
For very large \gg or the Sg^ in ([^ may be a matrix with numerically extreme values. This implies 

ill-conditioning and numerical instability under finite computer precision. On the other hand, ‘updating’ the 
target matrix will generally lead to updates for which the resulting estimator is not rotationally equivariant. 
This implies a reduction in computational speed. 


2. Estimation in special cases 

Here we explore scenarios for which we arrive at explicit targeted fused ridge estimators. These explicit 
solutions further insight into the behavior of the general estimator and they can provide computational 
speed-ups in certain situations. Three special cases are covered: 

I. Agg- = 0 for all g ^ g' ov equivalently Y.g' ^gg’ = ^g» = \g ^r all g-, 

II. Jli = • • • = Q,c and Tg = T for all 5 ; 

III. Tg = T for all g, Agg = A for all g, Xg^g^ = A/ for all gi ^ g 2 , and A/ —>■ 00 “. 

2.1. Special case I. When Agg/ = Ag, = Agg for all g, we have that Sg'^^g ^gg' = '^g'^g ^g'g ~ ® 
g. Hence, all fusion penalties are zero. The zero gradient equation (271 for class g then no longer hinges upon 


information from the remaining classes gX The targeted fused precision estimate for class g then reduces 
to (291 of Corollary This case thus coincides, as expected, with obtaining G decoupled non-fused ridge 
precision estimates. A special case that results in the same estimates occurs when considering Xg^g^ = A/ 
for all gi ^ g 2 and A/ is taken to be 0. 

2.2. Special case II. Suppose Og = ft and Tg = T for all g. Consequently, the fusion penalty term 
vanishes irrespective of the values of the Ag^gj, gi ^ 32 - The zero gradient equation (271 then reduces to 

0 = - rigSg - Xggi^ - T), 

for each class g. Adding all G equations implies: 

G a / G \ 

0 = ^ Hgft-^ - ngSg - XI ^ 

3=1 3=1 \ 3=1 

= — n,S, — tr(A)(II — T) 

tr(A) 


33 


(O-T) 


(S3) 


= fl~^ - 


s. - tAlx 


n. 


ft. 


n. 


We recognize that (S31 is of the form (22). Lemmamay then be directly applied to obtain the solution: 


(S4) 


fl(A) = 


A*Ip + -(S.-A*T)2 


1/2 


.(S.-A*T) 


where A* = tr(A)/n,. Hence, this second special case gives a non-fused penalized estimate that uses the 
pooled covariance matrix. It can be interpreted as an averaged penalized estimator. It is of importance in 


testing equality of the class precision matrices (see Section 4.1 of the main text). 


2.3. Special case III. Suppose that Tg = T for all g, that Agg = A for all g, and that Xg^^g^ = A/ for all 
9i 7^ 52 ■ The main optimization problem then reduces to (§. Clearly, for A/ —)■ 00 the fused penalty 


A 


/™({Og};A,A;,T) = -Xll^9-T 


22i^„-3 

3 31 >32 

is minimized when Hi = H 2 = ••• = flo- This is also implied, more rigorously, by Corollary Hence, 
the problem reduces to the special case of section 2.2 considered above. The solution to the penalized ML 


problem when A/ = 00 is then given by (S4) where tr(A) now implies GX. 
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3. Fused Kullback-Leibler approximate cross-validation 

3.1. Motivation. In £i-penalized estimation of the precision matrix, penalty selection implies (graphical) 
model selection: Regularization results in automatic selection of conditional dependencies. One then seeks 
to select an optimal value for the penalty parameter in terms of model selection consistency. To this end, the 
Bayesian information criterion (BIG), the extended BIG (EBIG), and the stability approach to regularization 
selection (StARS) are appropriate [IS]. The (fused) ^ 2 -penalty will not directly induce sparsity in precision 
matrix estimates. Hence, in £ 2 -penalized problems it is natural to choose the penalty parameters on the basis 
of efficiency loss. Of interest are then estimators of the Kullback-Leibler (KL) divergence, such as LOOGV, 
generalized approximate cross-validation (GAGV), and Akaike’s information criterion (AIG). While superior 
in terms of predictive accuracy due to its data-driven nature, the LOOGV is computationally very expensive. 
Vujacic et al. [13] proposed a KL-based GV loss with superior performance to both AIG and GAGV. The 
proposed method has closed-form solutions and thus provides a fast approximation to LOOGV. Here, we 
extend this method to provide a computationally friendly approximation of the fused LOOGV score. 


3.2. Formulation. Following Vujacic et al. [23], we now restate the KL approximation to LOOGV in the 
fused ridge setting. Let the true precision matrix for class g be denoted by fig. Its estimate, shorthanded by 
Ug can be obtained through Algorithm The KL divergence between the multivariate normal distributions 
A/’p(0, ^) and A/’p(0, ^) can be shown to be: 

KL(r 2 g, J7g) =-|tr(llg ^g) — Injfig llg|— 

For each g we wish to minimize this divergence. In the fused case we therefore consider the fused Kullback- 
Leibler (FKL) divergence which, motivated by the LOOGV score, is taken to be a weighted average of KL 
divergences: 


FKL({f2g},{ng}) 
G 


(S5) 


= - ^ Ug KL(flg, Og) = - ^ ^{^(n^lng) - Inlf^^lfigl - 
Ti^ Tl^ Z V J 

5=1 5=1 

The FKL divergence (S51 can, using the likelihood 0 , be rewritten as 

1 . 1 ^ . 

FKL =-£({$lg}; {Sg}) -I- bias, where bias = -—^ Ug tr[fig(r2g ^ — S 


9=1 


and where the equality holds up to the addition of a constant. It is clear that the bias term depends on 
the unknown true precision matrices and thus needs to be estimated. The fused analogue to the proposal of 
Vujaciic et al. |43| . called the fused Kullback-Leibler approximate cross-validation score or simply approximate 
fused LOOCV score, then is 


(5 6 ) 

with 

(57) 


1 


FKL(A) =- /:({f^g}; {Sg}) + bias. 


Or "-g 

bias = ^ - ^g)y^g + Agy7g(^g ” 


9=1 1=1 


and where Ag = The derivation of this estimate is given in Section 
A* such that the FKL approximate cross-validation score is minimized: 


B.2 


below. One would then choose 


(S 8 ) 


A* = argminFKL(A), subject to: A > 0 A diag(A) > 0. 

A 


The closed form expression in (S6) implies that A* is more rapidly determined than A*. As seen in the 
derivation. A* « A* for large sample sizes. 
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3.3. Derivation. Here we give, borrowing some ideas from Vujacic et al. [23], the derivation of the estimate 
(S6). Let observation i in class g be denoted by y^g and let S = Sig = Yigyjg be the sample covariance or 
scatter matrix of that observation. As before, the singularly indexed Sg = ^ class-specific 

sample covariance matrix. Throughout this section we will conveniently drop (some of) the explicit notation. 

The FKL divergence reframes the LOOCV score in terms of a likelihood evaluation and a bias term when 
S is not left out of class g. We thus study the change in the estimate as function of the single scatter matrix 
S. Let Dg(S) = Dg*® be the estimate in class g when S is omitted. That is, Dg(S) is part of the solution 
to the system 


(S9) 


+ l[a=g]S -f -f Aa = 0, for all O = 1, . . . , G, 

b^a 


where g-aa = and where is a matrix determined by the remaining data, penalty 

parameters and targets. Note that the penalized MLE can be denoted fig = r2g(0), which corresponds to 
the ‘full’ estimate resulting from the full gradient equation (27). 

We wish to approximate Dg(S) by a Taylor expansion around $lg(0), i.e.: 

Bfl 


Differentiating (S9) w.r.t. Sjji, the {j,j')th entry in S, and equating to zero yields 


0 = 


a 1 dfla - r 1 .JJ, dflf) 


dS. 


03 


b^a 


(SIO) 


= -ft 


.1 dfl, 




~^a ^ 


dflb 


+ l[a=g]F,gj,, for all j,f, 


where Ejj' is the null matrix except for unity in entries and The third term is obtained as 

dS/dSjji = Egg/ by the symmetric structure of S. This is also seen from the fact that S = 


Let 


V(S)« = E#7V. 


jj' 


and multiply (SIO) by Sjj> and sum over all j^j' to obtain 

(Sll) fl-^V{S)afl-^-Y,t^abV{S)b = t[a=g]S, for all a = l,...,G. 


We seek the solution vector V = {V(S)a}Q=i of square matrices for the system of equations in (Sll) which 
can be rewritten in the following way. Introduce and consider the linear operator (or block matrix): 


N = where Nab = 


^ ^ Ip if ^ ^ 

Mafolp ^ Ip if d ^ b 


Then V can be verihed to be the solution to the system (SIOI as 


N(V)a = ^N,bV(S), = 0 for a ^ g, and 

b 

N(V)g = ^NgbV(S)6 = S for a = g. 


Hence we need to invert N to solve for V. The structure of N is relatively simple, but there seems to be no 
(if any) simple inverse. Note that N = D — M is the difference of a (block) diagonal matrix D and a matrix 
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M depending on the ^’s: 


^ Ip- 


In terms of the /i’s we obtain to first order that 

N-i = (D - M)-i « T> ^ 


yielding the approximation 


(S12) 


(igiS) « ^ g (g) ^ll) (S) 

— Clg + CtgS^lg + fj, g g^l gSi}, g , 


where Slg = r2(0). To a first order in fj,gg this is the same as the approximation 

Og(S) « Og + (J7g 1 (g) tlg^ - Hgglp (g) Ip ) ^ ^ ( S ) . 

We also need an approximation for ln|r2g(S)|. By first-order Taylor expansion around S = 0 we have 


(S13) 


ln|Og(S)|«ln|ng(0)|+^tr[Og-i(0) 




^n|$7g(0)| tr tlg^{hg ®tlg+ ^Iggtll 0 I72)(S) 
= ln|ng(0)| -l-tr(SOg + HggflStll) , 


where we have used that ^ ln|A(t)| = tr[A(t) and « (J7g 0 ^g -|- fJ-gg^^ 0 Jlg)(Egj/). We now 

have the necessary equations to derive the FKL approximate cross-validation score. 

Define 


(S14) /(A,B) =ln|B|-tr(BA) 

by which the identity 


(5 15 ) ^/(S.g,fIg)=ng/(Sg,ng) 

holds for all g. The full likelihood ([^ in terms of / is given by 

G G 

(516) C{{ng}-{Sg}) CX ^ l|{ln|J^g| - tr(f2gSg)} = ^ ^/(Sg,flg), 


9=1 


9=1 


while the likelihood of a single Sig is 

Sig) CX 2 l^9l ~ tr(OgSig)| = -/(S^g, Hg). 


(S17) 
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In our setting, the fused LOOCV score is given by: 

. G rig 

LOOCV = - - E E (^9 

• 9=1i=l 

• 9=1i=l 

1 ° 1 


- E 2 E - /(S. 9 ,09)] 

* 9=1 *=i 

G G 

^ E - rT ^ ^ 

* 9 = 1 * 9 = 1 i = l 

G ‘^g 

& ^£({^9}; {S9}) - ^ E E 09)] 

* • g=l 2=1 

I . G ‘Tig 

^ £({^9}; {S9}) - ^ E E - tr(nES.9) - ^n\Clg\ + tr(n9S,9)] . 


<53 J- 

n 


9=1i=i 


Now, substitution of (S12) and (S13) gives the FKL approximate cross-validation score as an approximation 

G rig 


to the fused LOOCV score: 


where 


(S18) 


LOOCV « FKL = -^£({^ 9 }; {SJ) + ^ E E ^ 19 ’ 
• • 9=1 i=l 


Cig = tr(nsn -I- — tr(SJ7 -t- flggClSCl^) 

= tr(f2Sr2) -I- ^gg tr(n^Sil^) — tr(Sll) — fj,gg tr(OSi7^) 
= tr(Sll2) + ggg tr(Sn4) - tr(SO) - figg tr(Sfi3) 

= tr[S(f22-fj)] + figg tr[S{n'^ - Cl^)] 

= yjgi^^ - ^)yig + fj-ggyjgi^'^ - ^^)yig- 


To arrive at (S18) we have used the linear and cyclic properties of the trace operator. As S = VigY^g, the 
cyclic property implies the final equality since tr(SA) = tr(yjgy^A) = tr(yjAyjg) = yjAy^^. Equation 
(S18) is equivalent to the summand in (S7). 







